winsorize

2022-03-15 · 1 min read

Source: https://www.wikiwand.com/en/Winsorizing

Cheap way to remove outliers from samples.

  • TL;DR: winsorize(X, q in [0,0.5]) replaces all elements in X below quantile q and above quantile 1-q with the quantile q and 1-q respectively.
  • Just clamps outliers above and below a quantile.
  • Effectively removes outliers without changing the number of samples (length of X).
  • Used in cargo's default bench harness when taking timing samples.

Example Implementation #

Source: https://github.com/phlip9/idris2-aoc21/blob/master/src/Bench.idr

import Data.List
import Data.List1
import Data.String
import Data.Vec
import Scratch
import System.Clock
import System.Random

%default total

||| Clamp `x` to the range `[min, max]`. Assumes `min <= max`.
clamp : Ord a => (min : a) -> (max : a) -> (x : a) -> a
clamp min max x = if x > max then max
                  else if x < min then min
                  else x

||| Linearly interpolate between `x0` and `x1` according to the scale parameter `t`.
||| For example, `lerp` returns `x0` when `t = 0` and `x1` when `t = 1`.
lerp : Num a => Neg a => (x0 : a) -> (x1 : a) -> (t : a) -> a
lerp x0 x1 t = (1 - t) * x0 + (t * x1)

||| Linearly interpolate between `x0` and `x1` then clamp in the range `[x0, x1]`.
||| Useful for handling scale parameters `t ∉ [0, 1]`
lerpClamp : Ord a => Num a => Neg a => (x0 : a) -> (x1 : a) -> (t : a) -> a
lerpClamp x0 x1 t = clamp x0 x1 (lerp x0 x1 t)

||| Truncate a float number returning the whole part. Truncate works
||| correctly on negative numbers, unlike `floor`.
||| For example, `trunc 3.58 == 3.0` and `trunc (-4.9) == (-4.0)`
trunc : F64 -> F64
trunc x = cast {to=F64} $ cast {to=Integer} x

||| Return the fractional part of a float number.
||| For example, `frac 3.58 == 0.58` and `fract (-4.9) == (-0.9)`
fract : F64 -> F64
fract x = x - trunc x

data Interval a
  = OpenAbove a -- [a ..)
  | OpenBelow a -- (.. a]
  | Closed a a -- [a .. b]

||| Return the fractional index of a list as an `Interval` of the adjacent integer
||| number entries. A negative index will result in an `OpenBelow` interval while
||| an index greater than the largest integer index will result in an `OpenAbove`
||| interval.
fractIndex : (xs : List1 a) -> (idx : F64) -> Interval a
fractIndex (x ::: xs) idx = if idx < 0.0 then OpenBelow x
                            else go x xs (cast idx)
where
  go : a -> List a -> Nat -> Interval a
  go x [] idx = OpenAbove x
  go x (y :: _) 0 = Closed x y
  go _ (y :: ys) (S idx) = go y ys idx

||| Return the empirical quantile, interpolating between the two adjacent integer
||| number entries.
||| Assumes `xs` is sorted. Handles `q ∉ [0, 1]` by returning the min or max
||| respectively.
empiricalQuantile : (xs : List1 F64) -> (q : F64) -> F64
empiricalQuantile xs q
  = let n = cast {to=F64} $ length xs
        idx = (n * q) - 1
        t = fract idx in
        case fractIndex xs idx of
             OpenAbove x => x
             OpenBelow x => x
             Closed x y => lerpClamp x y t

||| Replace elements below/above `q`/`1-q` quantile with the `q`/`1-q` quantile
||| respectively. Reduces impact of outliers without changing the number of samples.
|||
||| Note: assumes xs is sorted
||| See: https://en.wikipedia.org/wiki/Winsorising
||| 
||| ### Example
|||
||| ```idris2
||| > winsorize [1.0,2,3,4,5,6,7,8,9,10] 0.25
||| [2.5 2.5, 3.0, 4.0, 5.0, 6.0, 7.0, 7.5, 7.5, 7.5]
||| ```
winsorize : (xs : List1 F64) -> (q : F64) -> List1 F64
winsorize xs q
  = let q = clamp 0.0 1.0 q
        q = if q > 0.5 then 1.0 - q else q
        lo = empiricalQuantile xs q
        hi = empiricalQuantile xs (1.0 - q) in
        map (clamp lo hi) xs

winsorizeUnsorted : (xs : List1 F64) -> (q : F64) -> List1 F64
winsorizeUnsorted xs q = winsorize (sort xs) q